Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Fix actor pool project splitting when column is not renamed #2998

Merged
merged 5 commits into from
Oct 5, 2024

Conversation

kevinzwang
Copy link
Member

Previously, this would fail:

import os

os.environ["DAFT_ENABLE_ACTOR_POOL_PROJECTIONS"] = "1"

import daft
from daft import udf

@udf(
    return_dtype=daft.DataType.int64(),
    batch_size=1
)
class MyUDF:
    def __init__(self):
        # import time
        # time.sleep(10)
        pass

    def __call__(self, _):
        
        # import time
        # time.sleep(10)

        import os

        pid = os.getpid()
        return [pid]

MyUDF = MyUDF.with_concurrency(4)

df = daft.from_pydict({"a": list(range(10))})
df = df.into_partitions(4)
df = df.select(MyUDF(df["a"]))
df = df.select(MyUDF(df["a"]))
df.show()

This is because when we split the project into multiple actor pool projects, we create new names for intermediate columns and lose the information about the original name. This PR fixes that by adding an alias to the end of the actor pool projects.

@kevinzwang kevinzwang requested a review from jaychia October 4, 2024 20:21
@github-actions github-actions bot added the bug Something isn't working label Oct 4, 2024
Copy link

codspeed-hq bot commented Oct 4, 2024

CodSpeed Performance Report

Merging #2998 will not alter performance

Comparing kevin/split-actor-pool-alias (e558ac7) with main (cd59c73)

Summary

✅ 17 untouched benchmarks

Copy link

codecov bot commented Oct 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.12%. Comparing base (a62d276) to head (e558ac7).
Report is 2 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2998      +/-   ##
==========================================
+ Coverage   77.80%   78.12%   +0.31%     
==========================================
  Files         602      602              
  Lines       71892    71461     -431     
==========================================
- Hits        55938    55830     -108     
+ Misses      15954    15631     -323     
Files with missing lines Coverage Δ
...al_optimization/rules/split_actor_pool_projects.rs 95.15% <100.00%> (+0.39%) ⬆️

... and 18 files with indirect coverage changes

Copy link
Contributor

@jaychia jaychia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome

recursive_count: usize,
) -> DaftResult<Transformed<Arc<LogicalPlan>>> {
// TODO: eliminate the need for recursive calls by doing a post-order traversal of the plan tree.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent :)

Co-authored-by: Jay Chia <[email protected]>
@kevinzwang kevinzwang enabled auto-merge (squash) October 5, 2024 01:47
@kevinzwang kevinzwang merged commit 53a84ea into main Oct 5, 2024
40 checks passed
@kevinzwang kevinzwang deleted the kevin/split-actor-pool-alias branch October 5, 2024 03:17
sagiahrac pushed a commit to sagiahrac/Daft that referenced this pull request Oct 7, 2024
…entual-Inc#2998)

Previously, this would fail:

```py
import os

os.environ["DAFT_ENABLE_ACTOR_POOL_PROJECTIONS"] = "1"

import daft
from daft import udf

@udf(
    return_dtype=daft.DataType.int64(),
    batch_size=1
)
class MyUDF:
    def __init__(self):
        # import time
        # time.sleep(10)
        pass

    def __call__(self, _):
        
        # import time
        # time.sleep(10)

        import os

        pid = os.getpid()
        return [pid]

MyUDF = MyUDF.with_concurrency(4)

df = daft.from_pydict({"a": list(range(10))})
df = df.into_partitions(4)
df = df.select(MyUDF(df["a"]))
df = df.select(MyUDF(df["a"]))
df.show()
```

This is because when we split the project into multiple actor pool
projects, we create new names for intermediate columns and lose the
information about the original name. This PR fixes that by adding an
alias to the end of the actor pool projects.

---------

Co-authored-by: Jay Chia <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants